Skip to content

feat: add OLMES variant of BigCodeBench#184

Open
tfburns wants to merge 6 commits intomainfrom
big_code_bench
Open

feat: add OLMES variant of BigCodeBench#184
tfburns wants to merge 6 commits intomainfrom
big_code_bench

Conversation

@tfburns
Copy link
Collaborator

@tfburns tfburns commented Feb 26, 2026

PR Checklist

  • Use descriptive commit messages.
  • Provide tests for your changes.
  • Update any related documentation and include any relevant screenshots.
  • Check if changes need to be made to docs (README or any guides in /docs/).

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

Adds a variant of the BigCodeBench task which mimics the OLMES implementation.

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

@tfburns tfburns marked this pull request as ready for review February 26, 2026 14:41
assert num_fewshot == 0, "Fewshot is not supported for BigCodeBench"
# Only the base BigCodeBench class disallows fewshot; subclasses (e.g. BigCodeBench_OLMES) may use it.
if self.__class__ is BigCodeBench and num_fewshot != 0:
raise ValueError("Fewshot is not supported for BigCodeBench; use BigCodeBench_OLMES for 3-shot.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the raise ValueError to a logger.warning that logs the requested value and resets to num_fewshot=0, which is the existing implementation of our BigCodeBench task. But adding this here to avoid user confusion since, oppositely, BigCodeBench_OLMES only runs with num_fewshot=0.

def _get_fewshot_target_text(self, item: dict[str, Any]) -> str:
# Match oe_eval doc_to_target for complete: canonical_solution + "\\n```"
target = item["canonical_solution"]
assert target is not None and isinstance(target, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, raise a ValueError as asserts can be turned off globally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced this with an explicit if not isinstance(target, str): raise ValueError(...).


test_code = r"""
import unittest
class TestCases(unittest.TestCase):
Copy link
Contributor

@prabhuteja12 prabhuteja12 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to rename these and have some description of what they are actually testing? I'm not sure why these tests uses unittest while the rest of the repo uses pytest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed these test methods to be a bit more descriptive added docstrings explaining that the unittest code in the test data strings reflects BigCodeBench's format, not our repo's test framework.

Those string gets passed to execute_python_code_with_tests(), which sends it to a Docker container where it's written to a file and run as a separate Python process. The import unittest happens inside the container's Python interpreter, not in the repo test runner's process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants